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Abstract. In Bayesian machine learning, conjugate priors are popular, 

f-^ mostly due to mathematical convenience. In this paper, we show that 

I there are deeper reasons for choosing a conjugate prior. Specifically, we 

f^ formulate the conjugate prior in the form of Bregman divergence and 

^SJ show that it is the inherent geometry of conjugate priors that makes 

1,^ them appropriate and intuitive. This geometric interpretation allows one 

^ to view the hyperparameters of conjugate priors as the effective sample 

^^ points, thus providing additional intuition. We use this geometric un- 

<H derstanding of conjugate priors to derive the hyperparameters and ex- 

I pression of the prior used to couple the generative and discriminative 

components of a hybrid model for semi-supervised learning. 

\^ Keywords: Bregman divergence. Conjugate prior. Exponential fami- 

i_1 lies. Generative models. 

o 

'— ' 1 Introduction 

^ In probabilistic modeling, a practitioner typically chooses a likelihood function 

f"^ (model) based on her knovi^ledge of the problem domain. With limited training 

\1 data, a simple maximum likelihood (ML) estimation of the parameters of this 

model will lead to overfitting and poor generalization. One can regularize the 
model by adding a prior, but the fundamental question is: which prior? We give 
a turn-key answer to this problem by analyzing the underlying geometry of the 
likelihood model and suggest choosing the unique prior with the same geometry 
as the likelihood. This unique prior turns out to be the conjugate prior, in the 
case of the exponential family. This provides justification beyond "computational 
convenience" for using the conjugate prior in machine learning and data mining 
applications. 
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J-j In this work, we give a geometric understanding of maximum likelihood esti- 

mation method and a geometric argument in the favor of using conjugate priors. 
In Section 4.1, first we formulate the ML estimation problem into a completely 
geometric problem with no explicit mention of probability distributions. We then 
show that this geometric problem carries a geometry that is inherent to the struc- 
ture of the likelihood model. For reasons given in Sections 4.3 and 4.4, when con- 
sidering the prior, it is important that one uses the same geometry as likelihood. 



2 Arvind Agarwal, Hal Daume III 

Using the same geometry also gives the closed-form solution for the maximum-a- 
posteriori (MAP) problem. We then analyze the prior using concepts borrowed 
from the information geometry. We show that this geometry induces the Fisher 
information metric and 1-connection, which are respectively, the natural metric 
and connection for the exponential family (Section 5). One important outcome 
of this analysis is that it allows us to treat the hyperparamcters of the conjugate 
prior as the effective sample points drawn from the distribution under considera- 
tion. This analysis also allows us to extend the results of MAP estimation in the 
exponential family to the a-family (Section 5.1) because, similar to exponential 
families, a-families also carry an inherent geometry [12]. We finally extend this 
geometric interpretation of conjugate priors to analyze the hybrid model given 
by [8] in a purely geometric setting and justify the argument presented in [1] (i.e. 
a coupling prior should be conjugate) using a much simpler analysis (Section 6). 
Our analysis couples the discriminative and generative components of hybrid 
model using the Bregman divergence which reduces to the coupling prior given 
in [1]. This analysis avoids the explicit derivation of the hyperparamcters, rather 
automatically gives the hyperparamcters of the conjugate prior along with the 
expression. 



2 Motivation 

Our analysis is driven by the desire to understand the geometry of the conjugate 
priors for the exponential families. This understanding has many advantages that 
are described in the remainder of the paper: an extension of notion of conjugacy 
beyond the exponential family (to a- family), and geometric analysis of models 
that use the conjugate priors [1]. 

We motivate our analysis by asking ourselves the following question: Given 
a parametric model p{x; 0) for the data likelihood, and a prior on its parameters 
6, p{6;a,l3); what should the hyperparamcters a and /3 of the prior encode? 
We know that 6 in the likelihood model is the estimation of the parameter 
using the given data points. In other words, the estimated parameter fits the 
model according to the given data while prior on the parameter provides the 
generalization. This generalization is enforced by some prior belief encoded in 
the hyperparamcters. Unfortunately, one does not know what is the likely value 
of the parameters; rather one might have some belief in what data points are 
likely to be sampled from the model. Now the question is: Do the hyperparam- 
eters encode this belief in the parameters in terms of the sampling points? Our 
analysis shows that the hyperparameters of the conjugate prior is nothing but 
the effective sampling points. In case of non-conjugate priors, interpretation of 
hyperparameters is not clear. 

A second motivation is the following geometric analysis. Before we go into 
the problem, consider two points in the Euclidean space which one would like to 
interpolate using a parameter 7 G [0, 1]. A natural way to do so is to interpolate 
them linearly i.e., connect two points using a straight line, and then find the 
interpolating point at the desired 7, as shown in Figure 1(a). This interpolation 



A Geometric View of Conjugate Priors 





Fig. 1. Interpolation of two points a and b using (a) Euclidean geometry, and (b) non- 
Euclidean geometry. Here geometry is defined by the respective distance/divergence func- 
tions de and dg. It is important to notice that the divergence is a generalized notion of the 
distance in the non-Euclidean spaces, in particular, in the spaces of the exponential family 
statistical manifolds. In these spaces, it is the divergence function that define the geometry. 



scheme does not change if we move to a non-EucHdean space. In other words, if 
we were to interpolate two points in the non-Euchdean space, we would find the 
interpolating point by connecting the two points by a geodesic (an equivalent to 
the straight line in the non-Euclidean space) and then finding the point at the 
desired 7, shown in Figure 1(b). 

This situation arises when one has two models and wants to build a better 
model by interpolating them. This exact situation is encountered in [8] where the 
objective is to build a hybrid model by interpolating (or coupling) discrimina- 
tive and generative models. Agarwal et.al. [1] couples these two models using the 
conjugate prior, and empirically shows using a conjugate prior for the coupling 
outperforms the original choice [8] of a Gaussian prior. In this work, we find 
the hybrid model by interpolating the two models using the inherent geometry^ 
of the space (interpolate along the geodesic in the space defined by the inher- 
ent geometry) which automatically results in the conjugate prior along with its 
hyperparameters. Our analysis and the analysis of Agarwal et al. lead to the 
same result, but ours is much simpler and naturally extends to the cases where 
one wants to couple more than two models. One big advantage of our analysis 
is that unlike prior approaches [1], we need not know the expression and the 
hyperparameters of the prior in advance. They are automatically derived by the 
analysis. Our analysis based on the geometric interpretation can also be used to 
interpolate the models using a polynomial of higher degree instead of just the 
straight line i.e., quadratic interpolation etc., and to derive the corresponding 
prior. Our analysis only requires the inherent geometry which is given by the 
models under the consideration and the interpolation parameters (parameters 
of the polynomial). No explicit expression of the coupling prior is needed. 



^ In exponential family statistical manifold, inherent geometry is defined by the diver- 
gence function because it is the divergence function that induces the metric structure 



and connection of the manifold. Refer [2] for more details. 
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Fig. 2. Duality between mean parameters and natural parameters. Notice the convex func- 
tions defined over both spaces, these functions are dual of each other and so are the spaces. 



3 Background 

In this section, we give the required background, specially, we revisit the concepts 
related to Legendre duality, exponential families and Bregman divergence. 

3.1 Legendre Duality 

Let M C M'' and 6> C M^ be two spaces and let i^ : M -)> K+ and G : 6* ^ K+ 
be two convex functions. F and G are said to be conjugate duals of each other 
if: 



^(a^) 



sup{(^, 
eee 



9) - G{e)) 



(1) 



here (a, b) denotes the dot product of vectors a and h. The spaces {O and M) 
associated with these dual functions are called dual spaces. We sometime use the 
standard notation to refer this duality i.e., G — F* and F — G* . A particularly 
important connection between dual spaces is that: for each fj, £ M,Vi^(/i) = 
9 G (denoted as ^* — 9)) and similarly, for each 9 e 0,'VG{9) = /i e M (or 
9* = /x))- For more details, refer to [9]. Figure 2 gives a pictorial representation 
of this duality and the notations associated with it. 



3.2 Bregman Divergence 

We now give a brief overview of Bregman divergence (for more details see [3]). 
Let f : M — > M be a continuously-differcntiable real-valued and strictly convex 
function defined on a closed convex set M. The Bregman divergence associated 
with F for points p, g G M is: 



BF{p\\q) = F{p) - F{q) - (VF(g), {p - q)) 



(2) 



If G is the conjugate dual of F then: 

Briph) 



■80(9* Up* 



(3) 



here p* and q* are the duals of p and q respectively. It is emphasized that Breg- 
man divergence is not symmetric i.e., in general, BF{p\\q) 7^ Bp{q\\p), therefore 
it is important what directions these divergences are measured in. 
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3.3 Exponential Family 



In this section, we review the exponential family. The exponential family is a 
set of distributions, whose probability density function^ can be expressed in the 
following form: 

pix; 9) = po(x)exp((0, 0(x)) - 0(6)) (4) 

here <j>{x) : X™ — > M'' is a vector potentials or sufficient statistics and G{9) is a 
normalization constant or log-partition function which ensures that distributions 
are normalized. With the potential functions (t>{x) fixed, every 9 induces a par- 
ticular member p{x; 9) of the family. In our framework, we deal with exponential 
families that are regular and have the minimal representation[ll]. 

The exponential family has a number of convenient properties and subsumes 
many common distributions. It includes the Gaussian, Binomial, Beta, Multino- 
mial and Dirichlet distributions, hidden Markov models, Bayes nets, etc.. One 
important property of the exponential family is the existence of conjugate pri- 
ors. Given any member of the exponential family in (4), the conjugate prior is 
a distribution over its parameters with the following form: 

p{9\a, P) = m{a, /3) exp((6i, a) - I3G{9)) (5) 

here a and /? are hyperparameters of the conjugate prior. Importantly, the func- 
tion G'(-) is the same between the exponential family member and its conjugate 
prior. 

A second important property of exponential family member is that log- 
partition function G is convex and defined over the convex set O := {9 ^ W^ : 
G{9) < oo}. Since the log-partition function G is convex over this set, it induces 
a Bregman divergence on the space 0. 

Another important property of the exponential family is the one-to-one map- 
ping between the canonical parameters 9 and the so-called ^^mean parameters'^ 
which we denote by p.. For each canonical parameter 9 (£ 0, there exists a mean 
parameter fi, which belongs to the space M defined as: 



1^ e E'' : /i = f <j){x)p{x- 9) dx \f9 e ©j (6) 



M 



Our notation has been deliberately suggestive. and M are dual spaces, in 
the sense of Legendre duality because of the following relationship between the 
log-partition function G{9) and the expected value of the sufficient statistics 

VG(0) - E(0(x)) = fi. (7) 

In Legendre duality, we know that two spaces and M are dual of each other 
if for each 9 E 0, VG{9) = /i G M. Here G (the log partition function of the 



^ "Density function" can be replaced by "mass function" in the case of discrete random 
variables 
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exponential family distribution) is the function defined on the space 0. WE call 
the function in the dual space M to be F i.e., F ~ G* . A pictorial representation 
of the duality between canonical parameter space and mean parameter space 
M is given in Figure 2. 



4 Likelihood, Prior and Geometry 

In this section, we first formulate the ML problem into a Bregman median prob- 
lem (Section 4.1) and then show that corresponding MAP problem can also be 
converted into a Bregman median problem (Section 4.3). The MAP Bregman 
median problem consists of two parts: a likelihood model and a prior. We argue 
(Sections 4.3 and 4.4) that a Bregman median problem makes sense only when 
both of these parts have the same geometry. Having the same geometry amounts 
to having the same log-partition function leading to the property of conjugate 
priors. 



4.1 Likelihood in the form of Bregman Divergence 

Following [5] , we can write the distributions belonging to the exponential family 
in terms of Bregman divergence. Let p{x; 6) be the exponential family distribu- 
tion as defined in (4), the log of which (likelihood) can be written as"': 

log p{x- 9) = log po{x) + F{x) - BF{x\\VG{e)) (8) 

This relationship depends on two observations: F{VG{9)) + G{9) = VG{9)9 
and VF{9) = {VG)-^{9) ^ {VF){VG{9)) = 9. These two observations can be 
used with (2) to see that (8) is equivalent to the probability distribution de- 
fined in (4). This representation of likelihood in the form of Bregman divergence 
gives insight in the geometry of the likelihood function. Gaining the insight into 
the exponential family distributions and establishing a meaningful relationship 
between likelihood and prior is the primary objective of this work. 

In learning problems, one is interested in estimating the parameters 9 of 
the model which results in low generalization error. Perhaps the most standard 
estimation method is maximum likelihood (ML). The ML estimate, 9ml, of a 
set of n i.i.d. training data points X = {xi, . . .a:„} drawn from the exponential 
family is obtained by solving the following problem: 

0ml = max log p{X; 9) 

Theorem 1. Let X = {xi, . . .a;„} he a set of n i.i.d. training data points drawn 
from the exponential family distribution with the log partition function G, F he 



^ For the simplicity of the notations we will use x instead of (t>{x) assuming that x € 
This does not change the analysis 
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the dual function of G, then dual of ML estimate (9ml) of X under the assumed 
exponential family model solves the following Bregman median problem: 

n 

fiML = miny^ BpixiWn) 
fj.eM-^ — ' 



Proof. The log-likelihood of X under the assumed exponential family distribution 
is given by log p{X;9) = '^"^ilog p(xi; 6) which along with (8) can be used to 

compute 0ml- 

n 

Oml = max^ {\ogpo{x,) + F{x,) ~ BF{x,\\VG{e)) 

n 
i = l 

which using (7) gives the desired result. 

The above theorem converts the problem of maximizing the log likelihood log p(X; 9) 
into an equivalent problem of minimizing the corresponding Bregman diver- 
gences which is nothing but a Bregman median problem, the solution to which 
is given by fiML = T^^=i ^i- ^^L estimate 9ml can now be computed using (7), 

9ml = {^G)-^{flML)- 

Lemma 1. If x is the sufficient statistics of the exponential family with the log 
partition function G, and F is the dual function of G defined over the mean 
parameter space M then x G M. 

Proof. Refer [7] pg 39. 

Lemma 2. Let x be the sufficient statistics of the exponential family with the 
log partition function G, and F be the dual function of G defined over the mean 
parameter space M, then there exists a 9 € &, such that x* = 9. 

Proof. M and are dual of each other so by the definition of duality, for every 
/u G M, there exists a, 9 £ O such that 9 = fi* , and From Lemma 1 since a; G M, 
which implies x* — 9. 

Corollary 1 (ML as Bregman Median). Let G{9) be the log partition func- 
tion of the exponential family defined over the convex set 0, X ~ {xi, . . .Xn} 
be set of n i.i.d data points drawn from this exponential family, and 9i be the 
dual of Xi, then ML estimation, 9ml of X — {xi, . . .Xn} solves the following 
optimization problem: 



9ML^T^riY,BG{9m (10) 
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Proof. Proof directly follows from Lemma 2 and Theorem 1. From Lemma 2, 
we know that x* = 6i. Now using Theorem 1 and (3), Bp(xi\\ii) = Bg{0\\x*) = 

The above expression requires us to find a so that divergence from 9 to 
other 9i is minimized. Now note that G is what defines this divergence and 
hence the geometry of the space (as discussed earlier in Section 2). Since 
G is the log partition function of exponential family, it is the log-partition 
function that determines the geometry of the space. We emphasize that 
divergence is measured from the parameter being estimated to other parameters 
Oi{s), as shown in Figure 3. 

Example 1. (1-D Gaussian) The exponential family representation of a 1-d 



-^^^^) with 61 = ^ and G{6) ^ ^9^ 
ML estimation is just (VG)^^(/i) = -^ which gives a = fi ^ ~^iXi i-6- data 



Gaussian is p{x;9) — , ^ ^ exp(— 2g-° ) ■'^ith — -^ and G{9) ~ %^^ whose 



Example 2. (1-D Bernoulli) The exponential family representation of a Bernoulli 
distribution p = a^(\ — a)^~^ is the distribution with 9 = log j^ with G{9) = 
log (1 + e^) whose ML estimation is {yG)~^{p) = log j^— . Comparing it with 9 
gives a — fj, — ^^ X^i ^i which is the estimated probability of the event in n trials. 

4.2 Conjugate Prior in the form of Bregman Divergence 

We now give an expression similar to the likelihood for the conjugate prior: 

log p(e\a, 13) = log m{a, /?) + I3({e, ^) - G{9)) (11) 

(11) can be written in the form of Bregman divergence by a direct comparison 
to (4), replacing x with a//3. 

log p{e\a, 13)= log m{a,l3) + pfFf^) - Bp f^\\\/G{9))) (12) 

The expression for the joint probability of data and parameters is given by: 

log p{x, e\a, 13) = log Po{x) + log m(a, /3) + F{x) + PF i/^ 

BF{x\\VG{e)) + i3BF ('^\\yG{e)\ 

Combining all terms that do not depend on 9: 

\ogp{x,9\a,P) = const - Bp{x\\^i) - /SBp (^\\pj (13) 
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4.3 Geometric Interpretation of Conjugate Prior 

In this section we give a geometric interpretation of the term Bp (a; 1 1 /i) +l3Bp ( § 1 1 /i) 
from (13). 

Theorem 2 (MAP as Bregman median). Given a set X ofn i.i.d examples 
drawn from the exponential family distribution with the log partition function G 
and a conjugate prior as in (12), MAP estimation of parameters is Omap = 
f^ldAP where fiMAP solves the following problem: 

fiMAP = miny^B F{xi\\fi) + /3Bf ( ^||m ) (14) 

which admits the following solution: 

En 
,^1 Xj + a 

^' n + 13 

Proof. MAP estimation by definition maximizes (13) for ah data points X which 
is equivalent to minimizing Bp{xi\\ii) + PBp(%\\iJ,). One can expand this expres- 
sion using (2) and use conditions F(VG'(6')) + G(0) = VG{e)e and VF{0) = 
(VG')~^(0) to obtain the desired solution. 

The above solution gives a natural interpretation of MAP estimation. One 
can think of prior as /3 number of extra points at position a/ (3. /3 works as the 
effective sample size of the prior . 

^= — ^^p — ^''^ 

The expression (14) is analogous to (9) in the sense that both are defined in 
the dual space M. One can convert (14) into an expression similar to (10) in the 
dual space which is again a Bregman median problem in the parameter space. 

9map ^ nnnJ^Boiem + J2BG{e\\{^y) (16) 

here (§)* <E 6? is dual of %. Above problem is Bregman median problem of n + /3 
points, {^1, 6*2 ■ ■ -On, {a/py , . . . , (a//3)*}, as shown in Figure 3 (left). 

"^ V ' 

P times 

A geometric interpretation is also shown in Figure 3. When the prior is conju- 
gate to the likelihood, they both have the same log-partition function (Figure 3, 
left). Therefore they induce the same Bregman divergence. Having the same di- 
vergence means that distances from 6 to 9i (in likelihood) and the distances from 
9 to {a/py are measured with the same divergence function, yielding the same 
geometry for both spaces. 

It is easier to see using the median formulation of MAP estimation problem 
that one must choose a prior that is conjugate. If one chooses a conjugate prior, 
then the distances among all points arc measured using the same function. It 



10 Arvind Agarwal, Hal Daume III 

is also clear from (15) that in the conjugate prior case, point induced by the 
conjugate prior {a/ 13)* behaves as a sample point. A median problem over a 
space that have different geometries is an ill-formed problem, as discussed further 
in the next section. 

4.4 Geometric Interpretation of Non-conjugate Prior 

We derived expression (16) because we considered the prior conjugate to the like- 
lihood function. Had we chosen a non-conjugate prior with log-partition function 
Q, we would have obtained: 

OML^rmnY,BG{em + Y,BQ(e\\(^ ^ (17) 

Here G and Q are different functions defined over 0. Since these are the functions 
that define the geometry of the space parameter, having G ^ Q is equivalent to 
consider them as being defined over different (metric) spaces. Here, it should be 
noted that distance between sample point (0i)and the parameter 9 is measured 
using the Bregman divergence Bq- On the other hand, the distance between 
the point induced by the prior {a/ (3)* and 6 is measured using the divergence 
function Bq. This means that {a/ j3)* can not be treated as one of the sample 
points. This tells us that, unlike the conjugate case, belief in the non-conjugate 
prior can not be encoded in the form of the sample points. 

Another problem with considering a non-conjugate prior is that dual space of 
O under different functions would be different. Thus, one will not be able to find 
the alternate expression for (17) equivalent to (14), and therefore not be able to 
find the closed-form expression similar to (15). This tells us why non-conjugate 
does not give us a closed form solution for Omap- 

A pictorial representation of this is also shown in Figure 3. Note that, unlike 
the conjugate case, in the non-conjugate case, the data likelihood and the prior 
both belong to different spaces. 

5 Information Geometric View 

In this section, we show the appropriateness of the conjugate prior from the 
information geometric angle. In information geometry, is a statistical manifold 
such that each 9 E defines a probability distribution. This statistical manifold 
has an inherent geometry, given by a metric and an affine connection. One 
natural metric is the Fisher information metric because of its many attractive 
properties: it is Riemannian and is invariant under reparameterization (for more 
details refer [2]). 

In exponential family distributions, the Fisher metric M{9) is induced by the 
KL-divergence KL{-\\9), which is equivalent to the Bregman divergence defined 
by the log-partition function. Thus, it is the log-partition function G that induces 
the Fisher metric, and therefore determines the natural geometry of the space. It 
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Conjugate 




Non-conjugate 




Fig. 3. Prior in the conjugate case has the same geometry as the likelihood while in the 
non-conjugate case, they have different geometries. 



justifies our earlier argument of choosing the log-partition function to define the 
geometry. Now if we were to treat the prior as a point on the statistical manifold 
defined by the likelihood model, Fisher information metric on the point given by 
the prior must be same as the one defined on likelihood manifold. This means 
that prior must have the same log-partition function as the likelihood i.e., it 
must be conjugate. 



5.1 Generalization to a-afRne manifold 

Not all probability distributions belong to the exponential family (although many 
do). A broader family of distributions is the "a- family" [2]. Although a full 
treatment of this family is beyond the scope of the work, we briefly discuss an 
extension of our results to the a-family. An a-family distribution is defined as: 

j^pix;9)^''-y^ a/1 
log p{x; 6) Q = 1 



',Pc(x;9) 



where p{x;9) defined as in (4). Note that the exponential family is a special 
case of Qf-family for a = 1. 

MAP estimation of the parameters in the exponential family can be cast as a 
median problem, where an appropriate Bregman divergence is used to define the 
geometry. In other words, for exponential family, a Bregman- median problem 
naturally arose as an estimation method. 

By using an appropriately defined, "natural," divergence for the a-faniily, 
one can actually obtain a similar result for this broader family of distrubtions. 
Using such a natural divergence, one can also define a "conjugate prior" for the 
a-family. Zhang et al. [12] shows that such a natural divergence exist for a-family 
and is given by: 

4 



^§(^1,^2 



— G(ei) + i^G(e2) 



G 



+ 



Like the exponential family, this divergence also induces the Fisher information 
metric. 
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6 Hybrid model 

In this section, we show an appHcation of our analysis to a common supervised 
and semi-supervised learning framework. In particular, we consider a genera- 
tive/discriminative hybrid model [1,6,8] that has been shown to be successful 
in many application. 

The hybrid model is a mixture of discriminative and generative models, each 
of which has its own separate set of parameters. These two sets of parameters 
(hence two models) are combined using a prior called the coupling prior. Let 
p{y\x.,9fi) be the discriminative component, p{x,y\6g) be the generative com- 
ponent and p{9d, Og) be the prior that couples discriminative and generative 
components. The joint likelihood of the data and parameters is: 

p(x, y, Od, e,) = p{e„ei)p{y\x, Oa)p{x\e,) (is) 

= p(eg,ed)p(y|x,ed)^p(x,y'|e,) 
v' 

Here 9d is a set of discriminative parameters, 9g a set of generative parameters, 
and p(9g, 9d) provides the natural coupling between these two sets of parameters. 
The most important aspect of this model is the coupling prior p{9g, 9d), which 
interpolates the hybrid model between two extremes: fully generative when the 
prior forces 9d — 9g, and fully discriminative when the prior renders 9d and 9g 
independent. In non-extreme cases, the goal of the coupling prior is to encourage 
the generative model and the discriminative model to have similar parameters. 
It is easy to see that this effect can be induced by many functions. One obvious 
way is to linearly interpolate them as done by [8,6] using a Gaussian prior (or 
the Euclidean distance) of the following form: 

p(6»g,6»d)cxcxp(-A||6(9-6(d||^) (19) 

where, when A = 0, model is purely discriminative while for A = cx), model is 
purely generative. Thus A in the above expression is the interpolating parameter, 
and is same as the 7 in Section 2. Note that log of prior is nothing but the squared 
Euclidean distance between two sets of parameters. 

It has been noted multiple times [4, 1] that a Gaussian prior is not always 
appropriate, and the prior should instead be chosen according to models being 
considered. Agarwal et al. [1] suggested using a prior that is conjugate to the 
generative model. Their main argument for choosing the conjugate prior came 
from the fact that this provides a closed form solution for the generative param- 
eters and therefore is mathematically convenient. We will show that it is more 
than convenience that makes conjugate prior appropriate. We show that choos- 
ing a non-conjugate prior is not only not convenient but also not appropriate. 
Moreover, our analysis does not assume anything about the expression and the 
hyperparamcters of the prior beforehand, rather derive them automatically. 

6.1 Generalized Hybrid Model 

In order to sec the effect of the geometry, we first present the generalized hybrid 
model for distributions that belong to the exponential family and present them in 
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Fig. 4. Parameters 9d and 6g are interpolated using the Bregman divergence 



form of Bregman divergence. Following the expression used in [1], the generative 
model can be written as: 



p(x, y\e,) = ft(x, y)cxp((e„ T(x, y)) - 0(0^)) 



(20) 



where T(-) is the potential function similar to <p in (4), now only defined on 

(x,i/). 

Let G* be the dual function of G; the corresponding Bregman divergence 
(retaining only the terms that depend on the parameter 9) is given by: 



Bg* ((x,y)||VG(e,)). 



(21) 



Solving the generative model independently reduces to choosing a Og from the 
space of all generative parameters Og which has a geometry defined by the log- 
partition function G. Similarly to the generative model, exponential form of the 
discriminative model is given as: 



p(2;|x, Od) - cM{Od, r(x, y)) - Midd, x)) 



(22) 



Importantly, the sufficient statistics T are the same in the generative and dis- 
criminative models; such generative/discriminative pairs occur naturally: logis- 
tic regression/naive Bayes and hidden Markov models/conditional random fields 
are examples. However, observe that in the discriminative case, the log partition 
function M depends on both x and 9d which makes the analysis of the discrimi- 
native model harder. Unlike the generative model, one does not have the explicit 
form of the log-partition function M that is independent of x. This means that 
the discriminative component (22) can not be converted into an expression like 
(21), and MLE problem can not be reduced to the Bregman median problem 
like the one given in (10). 



6.2 Geometry of the Hybrid Model 

We simplify the analysis of the hybrid model by writing the discriminative model 
in an alternate form. This alternate form makes obvious the underlying geome- 
try of the discriminative model. Note that the only difference between the two 
models is that discriminative model models the conditional distribution while 
generative model models the joint distribution. We can use this observation to 
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write the discriminative model in the following alternate form using the expres- 



sionp(v|x,^)^ ,.^fafj'i) and(20): 



„.„,, n . _ Mx,i;)exp({g,,T(x,y))-G(gd)) 

p[y\x, 6,) ^^^ ^^^^ j;')exp((e., T(x, y')) - G(e.)) ^'^^ 

Denote the space of parameters of the discriminative model by Od- It is easy 
to see that geometry of 0d is defined by G since function G is defined over 9d- 
This is same as the geometry of the parameter space of the generative model 
Og. Now let us define a new space 0h which is the affine combination of 0d and 
Og. Now, 0H will have the same geometry as 0d and 0g i.e., geometry defined 
by G. Now the goal of the hybrid model is to find a, 6 G 0h that maximizes 
the likelihood of the data under the hybrid model. These two spaces are shown 
pictorially in Figure 4. 

6.3 Prior Selection 

As mentioned earlier, the coupling prior is the most important part of the hy- 
brid model, which controls the amount of coupling between the generative and 
discriminative models. There are many ways to do this, one of which is given by 
[8, 6]. By their choice of Gaussian prior as coupling prior, they implicitly couple 
the discriminative and generative parameters by the squared Euclidean distance. 
We suggest coupling these two models by a general prior, of which the Gaussian 
prior is a special case. 

Bregman Divergence and Coupling Prior: Let a general coupling be given 
by Bs{0g\\9d)- Notice the direction of the divergence. We have chosen this di- 
rection because prior is induced on the generative parameters, and it is clear 
from (16) that parameters on which prior is induced, are placed in the first ar- 
gument in the divergence function. The direction of the divergence is also shown 
in Figure 4. 

Now we recall the relation (12) between the Bregman divergence and the 
prior. Ignoring the function m (this is consumed in the measure defined on the 
probability space) and replacing WG(9) by 6*, we get the following expression: 

log p{e,\a, 13) ^ PiFip ^ BpQe;)) (24) 

Now taking the a = XO*^ and jS — \, we get: 

\ogp{e,\\ei,\) = \(F{el) - BF{e*a\\el)) (25) 

p{e,\\el, A) = exp(A(F(e3))) exp(-ABF(e3lie;)) (26) 

For the general coupling divergence function Bs{0g\\6d), the corresponding cou- 
pling prior is given by: 

eM-^Bs*{ei\\ei)) ^ ci,p{-\{F{em p{e,\\ei,\) (27) 

The above relationship between the divergence function (left side of the ex- 
pression) and coupling prior (right side of the expression) allows one to define a 
Bregman divergence for a given coupling prior and vise versa. 
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Coupling Prior for the Hybrid Model: We know that that the geometry of 
the space underlying Gaussian prior is just Euclidean, which does not necessarily 
match the geometry of the likelihood space. The relationship between prior and 
divergence (27) allows one to first define the appropriate geometry for the model, 
and then define the prior that respects this geometry. In hybrid model, this 
geometry is given by the log partition function G of the generative model. This 
argument suggests to couple the hybrid model by the divergence of the form 
BQ{9g\\6ii). The coupling prior corresponding to this divergence function can be 
written using (27) as: 

exp(-ABG(6's||6»d)) =p(e9|Ae/,A) cxp(-AF(eS)) (28) 

where A — [0, oo] is the interpolation parameter, interpolating between the 
discriminative and generative extremes. In dual form, the above expression can 
be written as: 

eM-^BG{e,\\ei)) = p{eg\\e/ ,X) cxp(-AG(ed)). (29) 



Here exp(— AG(0d)) can be thought of as a prior on the discriminative param- 
eters p(0d). In above expression, exp(— Ai?G(^g||^d)) = p{(^ g\S g)p{d d) behaves as a 
joint couphng prior P{9d, 9g) as originally expected in the model (18). Note that 
hyperparameters of the prior a and f3 are naturally derived from the geometric 
view of the conjugate prior. Here a ~ X9'^ and /3 = A. 

Relation with Agarwal et al.: The prior we derived in the previous section 
turns out to be the exactly same as that proposed by Agarwal et al. [1], even 
though theirs was not formally justified. In that work, the authors break the 
coupled prior p{0g,9d) into two parts: p{Od) and p{0g\9d). They then derive an 
expression for the p{6g\9d) based on the intuition that the mode of p{9g\9d) 
should be 9d- Our analysis takes a different approach by coupling two models 
with the Bregman divergence rather than prior, and results in the expression 
and hyperparameters for the prior same as in [1]. 

The two analyses diverge here, however. Our analysis derives the hyperpa- 
rameters as: a = A(VG)^^(0d) and ji — \. However, the expression of the 
hyperparameters provided by Agarwal et al. [1] was: a = WG{9d) and (3 = X. 
Their derivation was the assumption that the mode of the coupling prior p{9g\9d) 
should be 9d. However, in the conjugate prior p(9\a,/3), the mode is %, and 
§ behaves as the sufficient statistics for the prior. These terms have come 
irom the data space, not from the parameter space. Therefore the mode of 
the coupling prior p{9g\9d) should not be 9d, but rather the dual of 9d which is 
{WG)^^(9d) = 9'^. Therefore, a — X9*^ and (3 — X and our model gives exactly 
this. 

7 Related Work and Conclusion 

To our knowledge, there have been no previous attempts to understand Bayesian 
priors from a geometric perspective. One related piece of work [10] uses the 
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Bayesian framework to find the best prior for a given distribution. It is noted 
that, in that work, the authors use the (5-geometry for the data space and the a- 
geometry for the prior space, and then show the different cases for different values 
((5, a). We emphasize that even though it is possible to use different geometry 
for the both spaces, it always makes more sense to use the same geometry. As 
mentioned in remark 1 in [10], useful cases are obtained only when we consider 
the same geometry. 

We have shown that by considering the geometry induced by a likelihood 
function, the natural prior that results is exactly the conjugate prior. We have 
used this geometric understanding of conjugate prior to derive the coupling prior 
for the discriminative/generative hybrid model. Our derivation naturally gives 
us the expression and the hyperparameters of this coupling prior. Like the hybrid 
model, this analysis can be used to give the much simpler geometric interpre- 
tations of many models, and to extend the existing results to other models, i.e. 
we have used this analysis to extend the geometric formulation of MAP problem 
for the exponential family to a-family. 
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